Latin Etymologies as Features on BNC Text Categorization

نویسندگان

  • Alex Chengyu Fang
  • Wanyin Li
  • Nancy Ide
چکیده

This paper presents an early experimental work on BNC Text Categorization (TC) with Latin etymologies as features, emphasis on spoken and written texts. Two aims achieved in this study: (1) to explore discriminative new linguistic features rather than lots of noise-bringing “bag-of-words” (BoW). (2) to build up a base step to represent texts in distinct types of linguistic features with different weighting scheme rather than a plain feature vectors of BoW. The experiments disclose a notable distinct distribution pattern of Latin etymologies in spoken and written BNC texts. The performance of a home-made classifier based on the probability distribution ranges of Latin etymologies reaches a precision of 72.31% and recall of 73.22% on BNC spoken texts and precision of 73.31% and recall of 69.98% on BNC written texts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

Corpus Linguistics with BNCweb - a Practical Guide

Book synopsis This book presents a richly illustrated, hands-on discussion of one of the fastest growing fields in linguistics today. The authors address key methodological issues in corpus linguistics, such as collocations, keywords and the categorization of concordance lines. They show how these topics can be explored step-by-step with BNCweb, a user-friendly web-based tool that supports soph...

متن کامل

Fuck revisited

This paper is a follow up to the investigation of McEnery, Baker and Hardie (2000) into the use of the word fuck in spoken British English. Both that paper and this are based on the British National Corpus. However, at the time of writing in 2000, the analysis of fuck in the written BNC had not been completed, hence the 2000 paper focussed on spoken English alone. In doing so, it explored the w...

متن کامل

Web as Corpus

The corpus resource for the 1990s was the BNC. Conceived in the 80s, completed in the mid 90s, it was hugely innovative and opened up myriad new research avenues for comparing different text types, sociolinguistics, empirical NLP, language teaching and lexicography. But now the web is with us, giving access to colossal quantities of text, of any number of varieties, at the click of a button, fo...

متن کامل

The Creation of a Spoken Sub-Corpus from the British National Corpus for Comparative Purposes

The British National Corpus (henceforth BNC) is one of the most frequently consulted corpora in linguistic research. While the use of this corpus is continuously on the increase, it appears that most BNC-related research work has exploited the corpus in its entirety, i.e. taking the corpus as a whole in analysing specific features or comparing with a different reference corpus. Despite the fact...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009